Optimal String Mining Under Frequency Constraints

نویسندگان

  • Johannes Fischer
  • Volker Heun
  • Stefan Kramer
چکیده

We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffixand lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient String Mining under Constraints Via the Deferred Frequency Index

We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory compared to the best-known algorithm of Fischer et al. Applications in various string domains, e.g. natural...

متن کامل

Impact of Pollution Location on Time and Frequency Characteristics of Leakage Current of Porcelain Insulator String under Different Humidity and Contamination Severity

One of the important factors influencing outdoor insulators performance is pollution phenomenon. The pollution, especially during humidity condition, reduces superficial resistance of insulator and lead to a flow of Leakage Currents (LC) on the insulator surface, which may result in total flashover. The LC characteristics are affected by parameters such as nature and severity of pollution. Loca...

متن کامل

Introducing Softness into Inductive Queries on String Databases

In many application domains (e.g., WWW mining, molecular biology), large string datasets are available and yet under-exploited. The inductive database framework assumes that both such datasets and the various patterns holding within them might be queryable. In this setting, queries which return patterns are called inductive queries and solving them is one of the core research topics for data mi...

متن کامل

Mitašiūnaitė Mining String Data under Similarity and Soft - Frequency Constraints : Application to Promoter Sequence Analysis

An inductive database is a database that contains not only data but also patterns. Inductive databases are designed to support the KDD process. Recent advances in inductive databases research have given rise to a generic solvers capable of solving inductive queries that are arbitrary Boolean combinations of anti-monotonic and monotonic constraints. They are designed to mine different types of p...

متن کامل

Optimal production strategy of bimetallic deposits under technical and economic uncertainties using stochastic chance-constrained programming

In order to catch up with reality, all the macro-decisions related to long-term mining production planning must be made simultaneously and under uncertain conditions of determinant parameters. By taking advantage of the chance-constrained programming, this paper presents a stochastic model to create an optimal strategy for producing bimetallic deposit open-pit mines under certain and uncertain ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006